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AIS-BN: An Adaptive Importance Sampling Algorithm for 
Evidential Reasoning in Large Bayesian Networks 



Stochastic sampling algorithms, while an attractive alternative to exact algorithms in 
very large Bayesian network models, have been observed to perform poorly in evidential 
reasoning with extremely unlikely evidence. To address this problem, we propose an adap- 
tive importance sampling algorithm, AIS-BN, that shows promising convergence rates 
even under extreme conditions and seems to outperform the existing sampling algorithms 
consistently. Three sources of this performance improvement are (1) two heuristics for 
initiahzation of the importance function that are based on the theoretical properties of im- 
portance sampling in finite-dimensional integrals and the structural advantages of Bayesian 
networks, (2) a smooth learning method for the importance function, and (3) a dynamic 
weighting function for combining samples from different stages of the algorithm. 

We tested the performance of the AIS-BN algorithm along with two state of the art 
general purpose sampling algorithms, likelihood weighting (Fung & Chang, 1989; Shachter 
& Peot, 1989) and self-importance sampling (Shachter & Peot, 1989). We used in our 
tests three large real Bayesian network models available to the scientific community: the 
CPCS network (Pradhan et al., 1994), the PathFinder network (Heckerman, Horvitz, 
& Nathwani, 1990), and the ANDES network (Conati, Gertner, VanLehn, & Druzdzel, 
1997), with evidence as unlikely as 10^''^. While the AIS-BN algorithm always performed 
better than the other two algorithms, in the majority of the test cases it achieved orders of 
magnitude improvement in precision of the results. Improvement in speed given a desired 
precision is even more dramatic, although we are unable to report numerical results here, 
as the other algorithms almost never achieved the precision reached even by the first few 
iterations of the AIS-BN algorithm. 

1. Introduction 

Bayesian networks (Pearl, 1988) are increasingly popular tools for modeling uncertainty in 
intelligent systems. With practical models reaching the size of several hundreds of variables 
(e.g., Pradhan et al., 1994; Conati et al., 1997), it becomes increasingly important to ad- 
dress the problem of feasibility of probabilistic inference. Even though several ingenious 
exact algorithms have been proposed, in very large models they all stumble on the theo- 
retically demonstrated NP-hardness of inference (Cooper, 1990). The significance of this 
result can be observed in practice — exact algorithms applied to large, densely connected 
practical networks require either a prohibitive amount of memory or a prohibitive amount 
of computation and are unable to complete. While approximating inference to any desired 
precision has been shown to be NP-hard as well (Dagum &; Luby, 1993), it is for very com- 
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plex networks the only alternative that will produce any result at all. Furthermore, while 
obtaining the result is crucial in all applications, precision guarantees may not be critical 
for some types of problems and can be traded off against the speed of computation. 

A prominent subclass of approximate algorithms is the family of stochastic sampling 
algorithms (also called stochastic simulation or Monte Carlo algorithms). The precision 
obtained by stochastic sampling generally increases with the number of samples generated 
and is fairly unaffected by the network size. Execution time is fairly independent of the 
topology of the network and is linear in the number of samples. Computation can be 
interrupted at any time, yielding an anytime property of the algorithms, important in time- 
critical applications. 

While stochastic sampling performs very well in predictive inference, diagnostic reason- 
ing, i.e., reasoning from observed evidence nodes to their ancestors in the network often 
exhibits poor convergence. When the number of observations increases, especially if these 
observations are unlikely a-priori, stochastic sampling often fails to converge to reason- 
able estimates of the posterior probabilities. Although this problem has been known since 
the very first sampling algorithm was proposed by Henrion (1988), little has been done 
to address it effectively. Furthermore, various sampling algorithms proposed were tested 
on simple and small networks, or networks with special topology, without the presence of 
extremely unlikely evidence and the practical significance of this problem has been un- 
derestimated. Given a typical number of samples used in real-time that are feasible on 
today's hardware, say 10*^ samples, the behavior of a stochastic sampling algorithm will be 
drastically different for different size networks. While in a network consisting of 10 nodes 
and a few observations, it may be possible to converge to exact probabilities, in very large 
networks only a negligibly small fraction of the total sample space will be probed. One of 
the practical Bayesian network models that we used in our tests, a subset of the CPCS 
network (Pradhan et al., 1994), consists of 179 nodes. Its total sample space is larger than 
10^^. With 10^ samples, we can sample only 10~^^ fraction of the sample space. 

We believe that it is crucial (1) to study the feasibility and convergence properties of 
sampling algorithms on very large practical networks, and (2) to develop sampling algo- 
rithms that will show good convergence under extreme, yet practical conditions, such as 
evidential reasoning given extremely unlikely evidence. After all, small networks can be 
updated using any of the existing exact algorithms — it is precisely the very large networks 
where stochastic sampling can be most useful. As to the likelihood of evidence, we know 
that stochastic sampling will generally perform well when it is high (Henrion, 1988). So, it 
is important to look at those cases in which evidence is very unlikely. In this paper, we test 
two existing state of the art stochastic sampling algorithms for Bayesian networks, likeli- 
hood weighting (Fung & Chang, 1989; Shachter & Peot, 1989) and self-importance sampling 
(Shachter & Peot, 1989), on a subset of the CPCS network with extremely unlikely evi- 
dence. We show that they both exhibit similarly poor convergence rates. We propose a new 
sampling algorithm, that we call the adaptive importance sampling for Bayesian networks 
(AIS-BN), which is suitable for evidential reasoning in large multiply-connected Bayesian 
networks. The AIS-BN algorithm is based on importance sampling, which is a widely 
applied method for variance reduction in simulation that has also been applied in Baye- 
sian networks (e.g., Shachter &: Peot, 1989). We demonstrate empirically on three large 
practical Bayesian network models that the AIS-BN algorithm consistently outperforms 



156 



Adaptive Importance Sampling in Bayesian Networks 



the other two algorithms. In the majority of the test cases, it achieved over two orders of 
magnitude improvement in convergence. Improvement in speed given a desired precision 
is even more dramatic, although we are unable to report numerical results here, as the 
other algorithms never achieved the precision reached even by the first few iterations of 
the AIS-BN algorithm. The main sources of improvement are: (1) two heuristics for the 
initialization of the importance function that are based on the theoretical properties of im- 
portance sampling in finite-dimensional integrals and the structural advantages of Bayesian 
networks, (2) a smooth learning method for updating the importance function, and (3) a 
dynamic weighting function for combining samples from different stages of the algorithm. 
We study the value of the two heuristics used in the AIS-BN algorithm: (1) initialization 
of the probability distributions of parents of evidence nodes to uniform distribution and 
(2) adjusting very small probabilities in the conditional probability tables, and show that 
they both play an important role in the AIS-BN algorithm but only a moderate role in the 
existing algorithms. 

The remainder of this paper is structured as follows. Section 2 first gives a general 
introduction to importance sampling in the domain of finite-dimensional integrals, where it 
was originally proposed. We show how importance sampling can be used to compute prob- 
abilities in Bayesian networks and how it can draw additional benefits from the graphical 
structure of the network. Then we develop a generalized sampling scheme that will aid us 
in reviewing the previously proposed sampling algorithms and in describing the AIS-BN 
algorithm. Section 3 describes the AIS-BN algorithm. We propose two heuristics for ini- 
tialization of the importance function and discuss their theoretical foundations. We describe 
a smooth learning method for the importance function and a dynamic weighting function 
for combining samples from different stages of the algorithm. Section 4 describes the em- 
pirical evaluation of the AIS-BN algorithm. Finally, Section 5 suggests several possible 
improvements to the AIS-BN algorithm, possible applications of our learning scheme, and 
directions for future work. 

2. Importance Sampling Algorithms for Bayesian Networks 

We feel that it is useful to go back to the theoretical roots of importance sampling in order 
to be able to understand the source of speedup of the AIS-BN algorithm relative to the 
existing state of the art importance sampling algorithms for Bayesian networks. We first 
review the general idea of importance sampling in finite-dimensional integrals and how it 
can reduce the sampling variance. We then discuss the application of importance sampling 
to Bayesian networks. Readers interested in more details are directed to literature on 
Monte Carlo methods in computation of finite integrals, such as the excellent exposition by 
Rubinstein (1981) that we are essentially following in the first section. 

2.1 Mathematical Foundations 

Let ^(X) be a function of m variables X = {Xi, X^) over a domain Q C R"^, such that 
computing g(X) for any X is feasible. Consider the problem of approximate computation 
of the integral 




(1) 
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Importance sampling approaches this problem by writing the integral (1) as 

where /(X), often referred to as the importance function, is a probability density function 
over il. /(X) can be used in importance sampling if there exists an algorithm for generating 
samples from /(X) and if the importance function is zero only when the original function 
is zero, i.e., ^(X) / =^ /(X) # 0. 

After we have independently sampled n points Si, S2, ■ ■ ■ , s„, Sj G Jl, according to the 
probability density function /(X), we can estimate the integral / by 

and estimate the variance of /„ by 

(3) 

It is straightforward to show that this estimator has the following properties: 

1. E{in) = I 

2. lim^_^oo In — I 

3. ^/n ■ {in — I) Normal(0, (Tj^jj^^), where 



2 



4. E(d^in)) =a^in)=cjj^y^^/n 

The variance of is proportional to crj(x) inversely proportional to the number of 
samples. To minimize the variance of /„, we can either increase the number of samples or 
try to decrease o"j(x)- With respect to the latter, Rubinstein (1981) reports the following 
useful theorem and corollary. 

Theorem 1 The minimum of Cj^^) *^ equal to 



-?(x)=(/j9(X)|dx)' 



and occurs when X is distributed according to the following probability density function 

b(x)| 



/(X) 



/nb(X)|^iX 
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Corollary 1 If g(X.) >0, then the optimal probability density function is 

/(X) = '-^ 

and o"j^x) ~ 0- 

Although in practice sampling from precisely /(X) = g(X.)/I will occur rarely, we expect 
that functions that are close enough to it can still reduce the variance effectively. Usually, 
the closer the shape of the function /(X) is to the shape of the function ^(X), the smaller 
is Cj(x)' high-dimensional integrals, selection of the importance function, /(X), is far 
more critical than increasing the number of samples, since the former can dramatically 
affect Cj(x)- seems prudent to put more energy in choosing an importance function 
whose shape is as close as possible to that of g(X) than to apply the brute force method of 
increasing the number of samples. 

It is worth noting here that if /(X) is uniform, importance sampling becomes a general 
Monte Carlo sampling. Another noteworthy property of importance sampling that can be 
derived from Equation 4 is that we should avoid /(X) <^ |9(X) —I-f{X.)\ in any part 
of the domain of sampling, even if /(X) matches well g{X.)/I in important regions. If 
/(X) ^ blX) — /-/(X)!, the variance can become very large or even infinite. We can 
avoid this by adjusting /(X) to be larger in unimportant regions of the domain of X. 

While in this section we discussed importance sampling for continuous variables, the 
results stated are valid for discrete variables as well, in which case integration should be 
substituted by summation. 

2.2 A Generic Importance Sampling Algorithm for Bayesian Networks 

In the following discussion, all random variables used are multiple- valued, discrete variables. 
Capital letters, such as A, B, or C, denote random variables. Bold capital letters, such as 
A, B, or C, denote sets of variables. Bold capital letter E will usually be used to denote 
the set of evidence variables. Lower case letters a, 6, c denote particular instantiations 
of variables A, B, and C respectively. Bold lower case letters, such as a, b, c, denote 
particular instantiations of sets A, B, and C respectively. Bold lower case letter e, in 
particular, will be used to denote the observations, i.e., instantiations of the set of evidence 
variables E. Anc(^) denotes the set of ancestors of node A. Pa{A) denotes the set of 
parents (direct ancestors) of node A. pa(j4) denotes a particular instantiation oiP8i{A). \ 
denotes set difference. Pa(A)|j,^g denotes that we use the extended vertical bar to indicate 
substitution of e for E in A. 

We know that the joint probability distribution over all variables of a Bayesian net- 
work model, Pr(X), is the product of the probability distributions over each of the nodes 
conditional on their parents, i.e., 

n 

Pr(X) = nPr(^i|Pa(X,)) . (5) 

1=1 

In order to calculate Pr(E = e), we need to sum over all Pr(X\E, E = e). 

Pr(E = e) = ^ Pr(X\E, E = e) (6) 

X\E 
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We can see that Equation 6 is almost identical to Equation 1 except that integration is 
replaced by summation and the domain 0, is replaced by X\E. The theoretical results 
derived for the importance sampling that we reviewed in the previous section can thus be 
directly applied to computing probabilities in Bayesian networks. 

While there has been previous work on importance sampling-based algorithms for Ba- 
yesian networks, we will postpone the discussion of this work until the next section. Here 
we will present a generic stochastic sampling algorithm that will help us in both reviewing 
the prior work and in presenting our algorithm. 

The posterior probability Pr(a|e) can be obtained by first computing Pr(a, e) and Pr(e) 
and then combining these based on the definition of conditional probability 

„ , , , Pr(a, e) 

In order to increase the accuracy of results of importance sampling in computing the pos- 
terior probabilities over different network variables given evidence, we should in general use 
different importance functions for Pr(a, e) and for Pr(e). Doing so increases the computa- 
tion time only linearly while the gain in accuracy may be significant given that obtaining 
a desired accuracy is exponential in nature. Very often, it is a common practice to use the 
same importance function (usually for Pr(e)) to sample both probabilities. If the difference 



1. Order the nodes according to their topological order. 

2. Initialize importance function Pr''(X\E), the desired number of samples 
m, the updating interval and the score arrays for every node. 

3. A; ^ 0, T ^ 

4. for « 1 to m do 

5. if {i mod / == 0) then 

6. k^k + l 

7. Update importance function Pr'^(X\E) based on T. 
end if 

8. Si <r- generate a sample according to Pr*'(X\E) 

9. T^TU{si} 

10. Calculate Score(s.j, Pr(X\E, e), Pr'^(X\E)) and add it to the correspond- 
ing entry of every score array according to the instantiated states. 

end for 

11. Normalize the score arrays for every node. 



Figure 1: A generic importance sampling algorithm. 
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between the optimal importance functions for these two quantities is large, the perfor- 
mance may deteriorate significantly. Although Pr(a, e) and Pr(e) are unbiased estimators 
according to Property 1 (Section 2.1), Pr(a|e) obtained by means of Equation 7 is not an 
unbiased estimator. However, as the number of samples increases, the bias decreases and 
can be ignored altogether when the sample size is large enough (Fishman, 1995). 

Figure 1 presents a generic stochastic sampling algorithm that captures most of the 
existing sampling algorithms. Without the loss of generality, we restrict ourselves in our 
description to so-called forward sampling, i.e., generation of samples in the topological 
order of the nodes in the network. The forward sampling order is accomplished by the 
initialization performed in Step 1, where parents of each node are placed before the node 
itself. In forward sampling. Step 8 of the algorithm, the actual generation of samples, works 
as follows, {i) each evidence node is instantiated to its observed state and is further omitted 
from sample generation; (ii) each root node is randomly instantiated to one of its possible 
states, according to the importance prior probability of this node, which can be derived 
from Pr'^(X\E); {in) each node whose parents are instantiated is randomly instantiated to 
one of its possible states, according to the importance conditional probability distribution 
of this node given the values of the parents, which can also be derived from Pr^(X\E); (iv) 
this procedure is followed until all nodes are instantiated. A complete instantiation Sj of the 
network based on this method is one sample of the joint importance probability distribution 
Pr'^(X\E) over all variables of the network. The scoring of Step 10 amounts to calculating 
Pr(s2, e)/Pr^(si), as required by Equation 2. The ratio between the total score sum and the 
number of samples is an unbiased estimator of Pr(e). In Step 10, if we also count the score 
sum under the condition A = a, i.e., that some unobserved variables A have the values a, 
the ratio between this score sum and the number of samples is an unbiased estimator of 
Pr(a, e). 

Most existing algorithms focus on the posterior probability distributions of individual 
nodes. As we mentioned above, for the sake of efficiency they count the score sum corre- 
sponding to Pr(^ = a,e), A G X\E, and record it in an score array for node A. Each 
entry of this array corresponds to a specified state of A. This method introduces additional 
variance, as opposed to using the importance function derived from Pr*'(X\E) to sample 
Pr(^ = a,e), A e X\E, directly. 

2.3 Existing Importance Sampling Algorithms for Bayesian Networks 

The main difference between various stochastic sampling algorithms is in how they process 
Steps 2, 7, and 8 in the generic importance sampling algorithm of Figure 1. 

Probabilistic logic sampling (Henrion, 1988) is the simplest and the first proposed sam- 
pling algorithm for Bayesian networks. The importance function is initialized in Step 2 to 
Pr(X) and never updated (Step 7 is null). Without evidence, Pr(X) is the optimal im- 
portance function for the evidence set, which is empty anyway. It escapes most authors 
that Pr(X) may be not the optimal importance function for Pi{A = a), ^4 G X, when A 
is not a root node. A mismatch between the optimal and the actually used importance 
function may result in a large variance. The sampling process with evidence is the same 
as without evidence except that in Step 10 we do not count the scores for those samples 
that are inconsistent with the observed evidence, which amounts to discarding them. When 
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the evidence is very unlikely, there is a large difference between Pr(X) and the optimal 
importance function. Effectively, most samples are discarded and the performance of logic 
sampling deteriorates badly. 

Likelihood weighting (LW) (Rmg & Chang, 1989; Shachter &: Peot, 1989) enhances the 
logic sampling in that it never discards samples. In likelihood weighting, the importance 
function in Step 2 is 

Pr(X\E) = n Pr{xi\Psi{Xi)) 

Likelihood weighting does not update the importance function in Step 7. Although likeli- 
hood weighting is an improvement on logic sampling, its convergence rate can be still very 
slow when there is large difference between the optimal importance function and Pr(X\E), 
again especially in situations when evidence is very unlikely. Because of its simplicity, the 
likelihood weighting algorithm has been the most commonly used simulation method for 
Bayesian network inference. It often matches the performance of other, more sophisticated 
schemes because it is simple and able to increase its precision by generating more samples 
than other algorithms in the same amount of time. 

Backward sampling (Fung & del Favero, 1994) changes Step 1 of our generic algorithm 
and allows for generating samples from evidence nodes in the direction that is opposite to 
the topological order of nodes in the network. In Step 2, backward sampling uses the likeli- 
hood of some of the observed evidence and some instantiated nodes to calculate Pr''(X\E). 
Although Fung and del Favero mentioned the possibility of dynamic node ordering, they 
did not propose any scheme for updating the importance function in Step 7. Backward 
sampling suffers from problems that are similar to those of likelihood weighting, i.e., a pos- 
sible mismatch between its importance function and the optimal importance function can 
lead to poor convergence. 

Importance sampling (Shachter &; Peot, 1989) is the same as our generic sampling algo- 
rithm. Shachter and Peot introduced two variants of importance sampling: self-importance 
(SIS) and heuristic importance. The importance function used in the first step of the 
self-importance algorithm is 



Pr°(X\E) = n Pr(a:^|Pa(X,)) 



E=e 



This function is updated in Step 7. The algorithm tries to revise the conditional probability 
tables (CPTs) periodically in order to make the sampling distribution gradually approach 
the posterior distribution. Since the same data are used to update the importance function 
and to compute the estimator, this process introduces bias in the estimator. Heuristic 
importance first removes edges from the network until it becomes a polytree, and then 
uses a modified version of the polytree algorithm (Pearl, 1986) to compute the likelihood 
functions for each of the unobserved nodes. Pr''(X\E) is a combination of these likelihood 
functions with Pr(X\E,e). In Step 7 heuristic importance does not update Pr'^(X\E). As 
Shachter and Peot (1989) point out, this heuristic importance function can still lead to a 
bad approximation of the optimal importance function. There exist also other algorithms 
such as a combination of self-importance and heuristic importance (Shachter Sz Peot, 1989; 
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Shwe &: Cooper, 1991). Although some researchers suggested that this may be a promising 
direction for the work on sampUng algorithms, we have not seen any results that would 
follow up on this. 

A separate group of stochastic sampling methods is formed by so-called Markov Chain 
Monte Carlo (MCMC) methods that are divided into Gibbs sampling. Metropolis sampling, 
and Hybrid Monte Carlo sampling (Geman & Geman, 1984; Gilks, Richardson, & Spiegel- 
halter, 1996; MacKay, 1998). Roughly speaking, these methods draw random samples from 
an unknown target distribution /(X) by biasing the search for this distribution towards 
higher probability regions. When applied to Bayesian networks (Pearl, 1987; Chavez & 
Cooper, 1990) this approach determines the sampling distribution of a variable from its 
previous sample given its Markov blanket (Pearl, 1988). This corresponds to updating 
Pr*^(X\E) when sampling every node. Pr*^(X\E) will converge to the optimal importance 
function for Pr(e) if Pr*'(X\E) satisfies some ergodic properties (York, 1992). Since the 
convergence to the limiting distribution is very slow and calculating updates of the sam- 
pling distribution is costly, these algorithms are not used in practice as often as the simple 
likelihood weighting scheme. 

There are also some other simulation algorithms, such as bounded variance algorithm 
(Dagum & Luby, 1997) and the A A algorithm (Dagum et al., 1995), which are essentially 
based on the LW algorithm and the Stopping- Rule Theorem (Dagum et al., 1995). Cano 
et al. (1996) proposed another importance sampling algorithm that performed somewhat 
better than LW in cases with extreme probability distributions, but, as the authors state, in 
general cases it "produced similar results to the likelihood weighting algorithm." Hernandez 
et al. (1998) also applied importance sampling and reported a moderate improvement on 
likelihood weighting. 

2.4 Practical Performance of the Existing Sampling Algorithms 

The largest network that has been tested using sampling algorithms is QMR-DT (Quick 
Medical Reference — Decision Theoretic) (Shwe et al., 1991; Shwe &: Cooper, 1991), which 
contains 534 adult diseases and 4,040 findings, with 40,740 arcs depicting disease-to-finding 
dependencies. The QMR-DT network belongs to a class of special bipartite networks 
and its structure is often referred to as BN20 (Henrion, 1991), because of its two-layer 
composition: disease nodes in the top layer and finding nodes in the bottom layer. Shwe 
and colleagues used an algorithm combining self-importance and heuristic importance and 
tested its convergence properties on the QMR-DT network. But since the heuristic method 
iterative tabular Bayes (ITB) that makes use of a version of Bayes' rule is designed for 
the BN20 networks, it cannot be generalized to arbitrary networks. Although Shwe and 
colleagues concluded that Markov blanket scoring and self-importance sampling significantly 
improve the convergence rate in their model, we cannot extend this conclusion to general 
networks. The computation of Markov blanket scoring is more complex in a general multi- 
connected network than in a BN20 network. Also, the experiments conducted lacked a 
gold-standard posterior probability distribution that could serve to judge the convergence 
rate. 

Pradhan and Dagum (1996) tested an efficient version of the LW algorithm — bounded 
variance algorithm (Dagum Sz Luby, 1997) and the AA algorithm (Dagum et al., 1995) on 
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a 146 node, multiply connected medical diagnostic Bayesian network. One limitation in 
their tests is that the probability of evidence in the cases selected for testing was rather 
high. Although over 10% of the cases had the probability of evidence on the order of 
10~^ or smaller, a simple calculation based on the reported mean /j, = 34.5 number of 
evidence nodes, shows that the average probability of an observed state of an evidence node 
conditional on its direct predecessors was on the order of (10-8)1/34.5 ^ o.59. Given that 
their algorithm is essentially based on the LW algorithm, based on our tests we suspect 
that the performance will deteriorate on cases where the evidence is very unlikely. Both 
algorithms focus on the marginal probability of one hypothesis node. If there are many 
queried nodes, the efficiency may deteriorate. 

We have tested the algorithms discussed in Section 2.3 on several large networks. Our 
experimental results show that in cases with very unlikely evidence, none of these algorithms 
converges to reasonable estimates of the posterior probabilities within a reasonable amount 
of time. The convergence becomes worse as the number of evidence nodes increases. Thus, 
when using these algorithms in very large networks, we simply cannot trust the results. We 
will present results of tests of the LW and SIS algorithms in more detail in Section 4. 

3. AIS-BN: Adaptive Importance Sampling for Bayesian Networks 

The main reason why the existing stochastic sampling algorithms converge so slowly is that 
they fail to learn a good importance function during the sampling process and, effectively, 
fail to reduce the sampling variance. When the importance function is optimal, such as 
in probabilistic logic sampling without any evidence, each of the algorithms is capable 
of converging to fairly good estimates of the posterior probabilities within relatively few 
samples. For example, assuming that the posterior probabilities are not extreme (i.e., larger 
than say 0.01), as few as 1,000 samples may be sufficient to obtain good estimates. In this 
section, we present the adaptive importance sampling algorithm for Bayesian networks 
(AIS-BN) that, as we will demonstrate in the next section, performs very well on most 
tests. We will first describe the details of the algorithm and prove two theorems that are 
useful in learning the optimal importance sampling function. 

3.1 Basic Algorithm — AIS-BN 

Compared with importance sampling used in normal finite-dimensional integrals, impor- 
tance sampling used in Bayesian networks has several significant advantages. First, the 
network joint probability distribution Pr(X) is decomposable and can be factored into 
component parts. Second, the network has a clear structure, which represents many con- 
ditional independence relationships. These properties are very helpful in estimating the 
optimal importance function. 

The basic AIS-BN algorithm is presented in Figure 2. The main differences between 
the AIS-BN algorithm and the basic importance sampling algorithm in Figure 1 is that 
we introduce a monotonically increasing weight function u;^ and two effective heuristic 
initialization methods in Step 2. We also introduce a special learning component in Step 7 
to let the updating process run more smoothly, avoiding oscillation of the parameters. The 
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1. 


Order the nodes according to their topological order. 




2. 


Initialize importance function Pr''(X\E) using some heuristic methods, ini- 




tialize weight w^, set the desired number of samples m 


and the updating 




iiiici vdi 6, liiiiidiiZic lilt; ocuit; diiitiyb lui t-vcij iiuLit;. 




3. 


^ U, ^ \y, WTScore ^ ^5 ^SUUl ^ ^ 




4. 


TO V 1 4 1 ^" TT) f1 




5. 


if (« mod / == 0) then 




6. 


ft — ft -|- i 




7. 


Update the importance function Pr'^(X\E) and w'^ 


based on T. 




end if 




8. 


Si generate a sample according to Pr'^(X\E) 




9. 


T^TU {s,} 




10. 


"^i5core ^ Score (s^, Pr(X\E,e), Pr'=(X\E), w'') 




11. 


WTScore ^ WTScore + WiScore 






(Optional: add WiScore to the corresponding entry of every score array) 


12. 


wsum ^ wsum + 






end for 




13. 


Output estimate of Pr(E) as WTScore/wsum 






(Optional: Normalize the score arrays for every node) 





Figure 2: The adaptive importance sampling for Bayesian Networks (AIS-BN) algorithm, 
score processing in Step 10 is 



'^iScore ^ 



Pr'=(s,) 

Note that in this respect the algorithm in Figure 1 becomes a special case of AIS-BN 
when w'^ = 1. The reason why we use u)*^ is that we want to give different weights to 
the sampling results obtained at different stages of the algorithm. As each stage updates 
the importance function, they will all have different distance from the optimal importance 
function. We recommend that w'^ oc 1/(J^, where ct*^ is the standard deviation estimated in 
stage k using Equation 3.^ In order to keep monotonically increasing, if w'' is smaller 
than w^~^, we adjust its value to w^~^. This weighting scheme may introduce bias into 



1. A similar weighting scheme based on variance was apparently developed independently by Ortiz and 
Kaelbling (2000), who recommend the weight w'' oc 1/(ct*)^. 
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the final result. Since the initial importance sampling functions are often inefficient and 
introduce big variance into the results, we also recommend that w'' = in the first few 
stages of the algorithm. We have designed this weighting scheme to reflect the fact that in 
practice estimates with very small estimated variance are usually good estimates. 

3.2 Modifying the Sampling Distribution in AIS-BN 

Based on the theoretical considerations of Section 2.1, we know that the crucial element of 
the algorithm is converging on a good approximation of the optimal importance function. 
In what follows, we first give the optimal importance function for calculating Pr(E = e) 
and then discuss how to use the structural advantages of Bayesian networks to approximate 
this function. In the sequel, we will use the symbol p to denote the importance sampling 
function and p* to denote the optimal importance sampling function. 
Since Pr(X\E, E = e) > 0, from Corollary 1 we have 

The following corollary captures this result. 

Corollary 2 The optimal importance sampling function p*(X\E) for calculating Pr(E = e) 
in Equation 6 is Pr(X|E = e). 

Although we know the mathematical expression for the optimal importance sampling 
function, it is difficult to obtain this function exactly. In our algorithm, we use the following 
importance sampling function 

n 

p(X\E) = nPr(^dPa(XO,E). (8) 

This function partially considers the effect of all the evidence on every node during the 
sampling process. When the network structure is the same as that of the network which 
has absorbed the evidence, this function is the optimal importance sampling function. It 
is easy to learn and, as our experimental results show, it is a good approximation to the 
optimal importance sampling function. Theoretically, when the posterior structure of the 
model changes drastically as the result of observed evidence, this importance sampling 
function may perform poorly. We have tried to find practical networks where this would 
happen, but to the day have not encountered a drastic example of this effect. 

Prom Section 2.2, we know that the score sums corresponding to pa(X;). e| can 
yield an unbiased estimator of Pr(a:.j, pa(Xj), e). According to the definition of conditional 
probability, we can get an estimator of Pr'(a;j|pa(Xi), e). This can be achieved by main- 
taining an updating table for every node, the structure of which mimicks the structure of 
the CPT. Such tables allow us to decompose the above importance function into compo- 
nents that can be learned individually. We will call these tables the importance conditional 
probability tables (ICPT). 

Definition 1 An importance conditional probability table (ICPT) of a node X is a table 
of posterior probabilities Pr(X|Pa(X),E = e) conditional on the evidence and indexed by 
its immediate predecessors, Pa(X). 



166 



Adaptive Importance Sampling in Bayesian Networks 



The ICPT tables will be modified during the process of learning the importance function. 
Now we will prove a useful theorem that will lead to considerable savings in the learning 
process. 

Theorem 2 



Proof: Suppose we have set the values of all the parents of node Xi to pa(Xj). Node Xi is 
dependent on evidence E given Y>^{Xi) only when Xi is c?-connecting with E given ^a.{Xi) 
(Pearl, 1988). According to the definition of d-connectivity, this happens only when there 
exists a member of Xis descendants that belongs to the set of evidence nodes E. In other 



Theorem 2 is very important for the AIS-BN algorithm. It states essentially that 
the ICPT tables of those nodes that are not ancestors of the evidence nodes are equal to 
the CPT tables throughout the learning process. We only need to learn the ICPT tables 
for the ancestors of the evidence nodes. Very often this can lead to significant savings in 
computation. If, for example, all evidence nodes are root nodes, we have our ICPT tables for 
every node already and the AIS-BN algorithm becomes identical to the likelihood weighting 
algorithm. Without evidence, the AIS-BN algorithm becomes identical to the probabilistic 
logic sampling algorithm. 

It is worth pointing out that for some Xj, Pr(Xj|P(j(Xj), E) (i.e., the ICPT table for 
Xi), can be easily calculated using exact methods. For example, when X-i is the only parent 
of an evidence node Ej and Ej is the only child of Xj, the posterior probability distribution 
of Xi is straightforward to compute exactly. Since the focus of the current paper is on 



Input: Initialized importance function Pr (X\E), learning rate ri{k). 
Output: An estimated importance function Pr'^(X\E). 
for stage A; to 5 do 



Xi G X,Xi ^ Anc(E) Pv{Xi\Pa.{Xi),B) = Pv{Xi\Pa.{Xi)) . 



(9) 



words Xi ^ Anc(E). 



□ 



1 



Sample / points sf , S2, . . . , independently according to the current im- 
portance function Pr'^(X\E). 



2 



For every node Xi such that Xi G X\E and Xj ^ Anc(E) count score sums 
corresponding to {xi,i)a{Xi),e} and estimate Pr'(2;2|pa(X2), e) based on sf , 




3 



Update Pr^(X\E) according to the following formula: 



{xi\pa.{Xi),e) = 



Pr^(x,|pa(X,),e) +7?(A;) ■ (Pr'(2;,|pa(X,), e) - Pr'=(2;,|pa(X,), e)) 



end for 



Figure 3: The AIS-BN algorithm for learning the optimal importance function. 
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sampling, the test results reported in this paper do not include this improvement of the 
AIS-BN algorithm. 

Figure 3 lists an algorithm that implements Step 7 of the basic AIS-BN algorithm listed 
in Figure 2. When we estimate Pi'{xi\p8i{Xi),e), we only use the samples obtained at the 
current stage. One reason for this is that the information obtained in previous stages has 
been absorbed by Pr'^(X\E). The other reason is that in principle, each successive iteration 
is more accurate than the previous one and the importance function is closer to the optimal 
importance function. Thus, the samples generated by Pr^+^(X\E) are better than those 
generated by Pr'^(X\E). Pr'(Xj|pa(Xj), e) — Pr''(Xj|pa(Xj), e) corresponds to the vector 
of first partial derivatives in the direction of the maximum decrease in the error. r]{k) is 
a positive function that determines the learning rate. When rj{k) =0 (lower bound), we 
do not update our importance function. When r]{k) = 1 (upper bound), at each stage we 
discard the old function. The convergence speed is directly related to r]{k). If it is small, 
the convergence will be very slow due to the large number of updating steps needed to 
reach a local minimum. On the other hand, if it is large, convergence rate will be initially 
very fast, but the algorithm will eventually start to oscillate and thus may not reach a 
minimum. There are many papers in the field of neural network learning that discuss how 
to choose the learning rate and let estimated importance function converge quickly to the 
destination function. Any method that can improve learning rate should be applicable to 
this algorithm. Currently, we use the following function proposed by Ritter et al. (1991) 



where a is the initial learning rate and b is the learning rate in the last step. This function 
has been reported to perform well in neural network learning (Ritter et al., 1991). 

3.3 Heuristic Initialization in AIS-BN 

The dimensionality of the problem of Bayesian network inference is equal to the number of 
variables in a network, which in the networks considered in this paper can be very high. 
As a result, the learning space of the optimal importance function is very large. Choice of 
the initial importance function Pr*'(X\E) is an important factor affecting the learning — 
an initial value of the importance function that is close to the optimal importance function 
can greatly affect the speed of convergence. In this section, we present two heuristics that 
help to achieve this goal. 

Due to their explicit encoding of the structure of a decomposable joint probability distri- 
bution, Bayesian networks offer computational advantages compared to finite-dimensional 
integrals. A possible first approximation of the optimal importance function is the prior 
probability distribution over the network variables, Pr(X). We propose an improvement on 
this initialization. We know that the effect of evidence nodes on a node will be attenuated 
as the path length of that node to evidence nodes is increased (Henrion, 1989) and the 
most aff'ected nodes are the direct ancestors of the evidence nodes. Initializing the ICPT 
tables of the parents of the evidence nodes to uniform distributions in our experience im- 
proves the convergence rate. Furthermore, the CPT tables of the parents of an evidence 
node E may be not favorable to the observed state e if the probability of £^ = e without 




(10) 
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any condition is less than a small value, such as Ft{E = e) < 1/(2 • ue), where ue is the 
number of outcomes of node E. Based on this observation, we change the CPT tables of 
the parents of an evidence node E to uniform distributions in our experiment only when 
Pr(£^ = e) < 1/(2 • ue), otherwise we leave them unchanged. This kind of initialization 
involves the knowledge of Ft{E = e), the marginal probability without evidence. Proba- 
bilistic logic sampling (Henrion, 1988) enhanced by Latin hypercube sampling (Cheng Sz 
Druzdzel, 2000b) or quasi-Monte Carlo methods (Cheng & Druzdzel, 2000a) will produce a 
very good estimate of Pr(£^ = e). This is an one-time effort that can be made at the model 
building stage and is worth pursuing to any desired precision. 

Another serious problem related to sampling are extremely small probabilities. Suppose 
there exists a root node with a state s that has the prior probability Fi{s) = 0.0001. Let 
the posterior probability of this state given evidence be Pr(s|E) = 0.8. A simple calculation 
shows that if we update the importance function every 1,000 samples, we can expect to 
hit s only once every 10 updates. Thus s's convergence rate will be very slow. We can 
overcome this problem by setting a threshold 6 and replacing every probability p < 6 in the 
network by 0? At the same time, we subtract {6 — p) from the largest probability in the 
same conditional probability distribution. For example, the value oi 6 = 10//, where I is 
the updating interval, will allow us to sample 10 times more often in the first stage of the 
algorithm. If this state turns out to be more likely (having a large weight), we can increase 
its probability even more in order to converge to the correct answer faster. Considering 
that we should avoid /(X) ^ |9(X) — /• /(X)| in an unimportant region as discussed in 
Section 2.1, we need to make this threshold larger. We have found that the convergence 
rate is quite sensitive to this threshold. Based on our empirical tests, we suggest to use 
6 = 0.04 in networks whose maximum number of outcomes per node does not exceed five. 
A smaller threshold might lead to fast convergence in some cases but slow convergence in 
others. If one threshold does not work, changing it in a specific network will usually improve 
convergence rate. 

3.4 Selection of Parameters 

There are several tunable parameters in the AIS-BN algorithm. We base the choice of 
these parameters on the Central Limit Theorem (CLT). According to CLT, if Zi, Z2, . . . , 
Zn are independent and identically distributed random variables with E{Zi) = Hz and 
Var(Zj) =cr|,2 = l,...,n, then Z = (Zi + ... + Z„)/n is approximately normally distributed 
when n is sufficiently large. Thus, 



lim Pi 




Although this approximation holds when n approaches infinity, CLT is known to be very 
robust and lead to excellent approximations even for small n. The formula of Equation 11 
is an (sr, 5) Relative Approximation, which is an estimate Ji oi fj, that satisfies 

P(fcH >er)<S. 

2. This initialization heuristic was apparently developed independently by Ortiz and Kaelbling (2000). 
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If 6 has been fixed, 

where ^z{z) = e^^'^^'^dx. Since in our samphng problem, /j.^ (corresponding to 

Pr(E) in Figure 2) has been fixed, setting to a smaller value amounts to letting ozj \/n 
be smaller. So, we can adjust the parameters based on az/^/n, which can be estimated 
using Equation 3. It is also the theoretical intuition behind our recommendation oc I/ct*^ 
in Section 3.1. While we expect that this should work well in most networks, no guarantees 
can be given here — there exist always some extreme cases in sampling algorithms in which 
no good estimate of variance can be obtained. 

3.5 A Generalization of AIS-BN: The Problem of Estimating Pr(a|e) 

A typical focus of systems based on Bayesian networks is the posterior probability of various 
outcomes of individual variables given evidence, Pr(a|e). This can be generalized to the 
computation of the posterior probability of a particular instantiation of a set of variables 
given evidence, i.e., Pr(A = a|e). There are two methods that are capable of performing 
this computation. The first method is very efficient at the expense of precision. The second 
method is less efficient, but offers in general better convergence rates. Both methods are 
based on Equation 7. 

The first method reuses the samples generated to estimate Pr(e) in estimating Pr(a, e). 
Estimation of Pr(a, e) amounts to counting the scored sum under the condition A = a. 
The main advantage of this method is its efficiency — we can use the same set of samples 
to estimate the posterior probability of any state of a subset of the network given evidence. 
Its main disadvantage is that the variance of the estimated Pr(a, e) can be large, especially 
when the numerical value of Pr(a|e) is extreme. This method is the most widely used 
approach in the existing stochastic sampling algorithms. 

The second method, used much more rarely (e.g., Cano et al., 1996; Pradhan &; Dagum, 
1996: Dagum & Luby, 1997), calls for estimating Pr(e) and Pr(a, e) separately. After 
estimating Pr(e), an additional call to the algorithm is made for each instantiation a of 
the set of variables of interest A. Pr(a, e) is estimated by sampling the network with the 
set of observations e extended by A = a. The main advantage of this method is that it 
is much better at reducing variance than the first method. Its main disadvantage is the 
computational cost associated with sampling for possibly many combinations of states of 
nodes of interest. 

Cano et al. (1996) suggested a modified version of the second method. Suppose that we 
are interested in the posterior distribution Pr(ai|e) for all possible values ai of A, i = 1, 
2, . . . , k. We can estimate Pr(a.j,e) for each i = 1, . . . , A; separately, and use the value 
Pr(ai, e) as an estimate for Pr(e). The assumption behind this approach is that the 
estimate of Pr(e) will be very accurate because of the large sample from which it is drawn. 
However, even if we can guarantee small variance in every Pr(ai,e), we cannot guarantee 
that their sum will also have a small variance. So, in the AIS-BN algorithm we only use 
the pure form of each of the methods. The algorithm listed in Figure 2 is based on the first 
method when the optional computations in Steps 12 and 13 are performed. An algorithm 
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corresponding to the second method skips the optional steps and calls the basic AIS-BN 
algorithm twice to estimate Pr(e) and Pr(a, e) separately. 

The first method is very attractive because of its simplicity and possible computational 
efficiency. However, as we have shown in Section 2.2, the performance of a sampling algo- 
rithm that uses just one set of samples (as in the first method above) to estimate Pr(a|e) 
will deteriorate if the difference between the optimal importance functions for Pr(a,e) and 
Pr(e) is large. If the main focus of the computation is high accuracy of the posterior proba- 
bility distribution of a small number of nodes, we strongly recommend to use the algorithm 
based on the second method. Also, this algorithm can be easily used to estimate confidence 
intervals of the solution. 

4. Experimental Results 

In this section, we first describe the experimental method used in our tests. Our tests focus 
on the CPCS network, which is one of the largest and most realistic networks available 
and for which we know precisely which nodes are observable. We were, therefore, able 
to generate very realistic test cases. Since the AIS-BN algorithm uses two initialization 
heuristics, we designed an experiment that studies the contribution of each of these two 
heuristics to the performance of the algorithm. To probe the extent of AIS-BN algorithm's 
excellent performance, we test it on several real and large networks. 

4.1 Experimental Method 

We performed empirical tests comparing the AIS-BN algorithm to the likelihood weighting 
(LW) and the self-importance sampling (SIS) algorithms. The two algorithms are basically 
the state of the art general purpose belief updating algorithms. The AA (Dagum et al., 
1995) and the bounded variance (Dagum &; Luby, 1997) algorithms, which were suggested 
by a reviewer, are essentially enhanced special purpose versions of the basic LW algorithm. 
Our implementation of the three algorithms relied on essentially the same code with separate 
functions only when the algorithms differed. It is fair to assume, therefore, that the observed 
differences are purely due to the theoretical differences among the algorithms and not due to 
the efficiency of implementation. In order to make the comparison of the AIS-BN algorithm 
to LW and SIS fair, we used the first method of computation (Section 3.5), i.e., one that 
relies on single sampling rather than calling the basic AIS-BN algorithm twice. 

We measured the accuracy of approximation achieved by the simulation in terms of the 
Mean Square Error {MSE), i.e., square root of the sum of square differences between Pr'(a;jj) 
and Pr(a;.y ), the sampled and the exact marginal probabilities of state j {j = 1,2,..., rii) 
of node i, such that Xi ^ E. More precisely. 



where N is the set of all nodes, E is the set of evidence nodes, and n-i is the number of 
outcomes of node i. In all diagrams, the reported MSE is averaged over 10 runs. We used 
the clustering algorithm (Lauritzen Sz Spiegelhalter, 1988) to compute the gold standard 



MSE 



1 



^ ^(Pr'(a:,,)-Pr(:r,,))2 
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results for our comparisons of the mean square error. We performed all experiments on a 
Pentium II, 333 MHz Windows computer. 

While MSE is not perfect, it is the simplest way of capturing error that lends itself to 
further theoretical analysis. For example, it is possible to derive analytically the idealized 
convergence rate in terms of MSE, which, in turn, can be used to judge the quality of the 
algorithm. MSE has been used in virtually all previous tests of sampling algorithms, which 
allows interested readers to tie the current results to the past studies. A reviewer offered 
an interesting suggestion of using cross-entropy or some other technique that weights small 
changes near zero much more strongly than the equivalent size change in the middle of 
the [0, 1] interval. Such measure would penalize the algorithm for imprecisions of possibly 
several orders of magnitude in very small probabilities. While this idea is interesting, we 
are not aware of any theoretical reasons as to why this measure would make a difference in 
comparisons between AIS-BN, LW and SIS algorithms. The MSE, as we mentioned above, 
will allow us to compare the empirically determined convergence rate to the theoretically 
derived ideal convergence rate. Theoretically, the MSE is inversely proportional to the 
square root of the sample size. 

Since there are several tunable parameters used in the AIS-BN algorithm, we list the 
values of the parameters used in our test: / = 2, 500; i;;^ = for A; < 9 and = I 
otherwise. We stopped the updating process in Step 7 of Figure 2 after k > 10. In other 
words, we used only the samples collected in the last step of the algorithm. The learning 
parameters used in our algorithm are A;max = 10, a = 0.4, and 6 = 0.14 (see Equation 10). 
We used an empirically determined value of the threshold = 0.04 (Section 3.3). We only 
change the CPT tables of the parents of a special evidence node A to uniform distributions 
when Pr(^ = o.) < 1/(2 " i^a)- Some of the parameters are a matter of design decision 
(e.g., the number of samples in our tests), others were chosen empirically. Although we 
have found that these parameters may have different optimal values for different Bayesian 
networks, we used the above values in all our tests of the AIS-BN algorithm described in 
this paper. Since the same set of parameters led to spectacular improvement in accuracy 
in all tested networks, it is fair to say that the superiority of the AIS-BN algorithm to the 
other algorithms is not too sensitive to the values of the parameters. 

For the SIS algorithm, w'' = 1 hy the design of the algorithm. We used I = 2, 500. The 
updating function in Step 7 of Figure 1 is that of (Shwe et al., 1991; Cousins, Chen, & 
Frisse, 1993): 

^^newK^ilP^K^i)^^) — I + k ' 

where Pr{xi\pa{Xi)) is the original sampling distribution. Pi f,y^j,j,g^f{xi\pa{Xi) , e) is an 
equivalent of our ICPT tables estimator based on all currently available information, and 
k is the updating step. 

4.2 Results for the CPCS Network 

The main network used in our tests is a subset of the CPCS (Computer-based Patient Case 
Study) model (Pradhan et al., 1994), a large multiply-connected multi-layer network con- 
sisting of 422 multi- valued nodes and covering a subset of the domain of internal medicine. 
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Among the 422 nodes, 14 nodes describe diseases. 33 nodes describe history and risk fac- 
tors, and the remaining 375 nodes describe various findings related to the diseases. The 
CPCS network is among the largest real networks available to the research community at 
the present time. The CPCS network contains many extreme probabilities, typically on 
the order of 10^^. Our analysis is based on a subset of 179 nodes of the CPCS network, 
created by Max Henrion and Malcolm Pradhan. We used this smaller version in order to 
be able to compute the exact solution for the purpose of measuring approximation error in 
the sampling algorithms. 

The AIS-BN algorithm has some learning overhead. The following comparison of exe- 
cution time vs. number of samples may give the reader an idea of this overhead. Updating 
the CPCS network with 20 evidence nodes on our system takes the AIS-BN algorithm a 
total of 8.4 seconds to learn. It generates subsequently 3,640 samples per second, while the 
SIS algorithm generates 2,631 samples per second, and the LW algorithm generates 4,167 
samples per second. In order to remain conservative towards the AIS-BN algorithm, in all 
our experiments we fixed the execution time of the algorithms (our limit was 60 seconds) 
rather than the number of samples. In the CPCS network with 20 evidence nodes, in 60 
seconds, AIS-BN generates about 188,000 samples, SIS generates about 158,000 samples 
and LW generates about 250,000 samples. 




1E-40 1E-34 1E-28 1E-22 1E-16 1E-10 

Probability of evidence 



Figure 4: The probability distribution of evidence Pr(E = e) in our experiments. 

We generated a total of 75 test cases consisting of five sequences of 15 test cases each. We 
ran each test case 10 times, each time with a different setting of the random number seed. 
Each sequence had a progressively higher number of evidence nodes: 15, 20, 25, 30, and 
35 evidence nodes respectively. The evidence nodes were chosen randomly (equiprobable 
sampling without replacement) from those nodes that described various plausible medical 
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findings. Almost all of these nodes were leaf nodes in the network. We believe that this 
constituted very realistic test cases for the algorithms. The distribution of the prior prob- 
ability of evidence, Pr(E = e), across all test runs of our experiments is shown in Figure 4. 
The least likely evidence was 5.54 x 10~^^, the most likely evidence was 1.37 x 10~^, and 
the median was 7 x 10^^^. 
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Figure 5: A typical plot of convergence of the tested sampling algorithms in our experiments 
— Mean Square Error as a function of the execution time for a subset of the 
CPCS network with 20 evidence nodes chosen randomly among plausible medical 
observations (Pr(E = e) = 3.33 x 10"^^ in this particular case) for the AIS-BN, 
the SIS, and the LW algorithms. The curve for the AIS-BN algorithm is very 
close to the horizontal axis. 



Figures 5 and 6 show a typical plot of convergence of the tested sampling algorithms in 
our experiments. The case illustrated involves updating the CPCS network with 20 evidence 
nodes. We plot the MSE after the initial 15 seconds during which the algorithms start 
converging. In particular, the learning step of the AIS-BN algorithm is usually completed 
within the first 9 seconds. We ran the three algorithms in this case for 150 seconds rather 
than the 60 seconds in the actual experiment in order to be able to observe a wider range of 
convergence. The plot of the MSE for the AIS-BN algorithm almost touches the X axis in 
Figure 5. Figure 6 shows the same plot in a finer scale in order to show more detail in the 
AIS-BN convergence curve. It is clear that the AIS-BN algorithm dramatically improves 
the convergence rate. We can also see that the results of AIS-BN converge to exact results 
very fast as the sampling time increases. In the case captured in Figures 5 and 6, a tenfold 
increase in the sampling time (after subtracting the overhead for the AIS-BN algorithm, it 
corresponds to a 21.5-fold increase in the number of samples) results in a 4.55-fold decrease 
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Figure 6: The lower part of the plot of Figure 5 showing the convergence of the AIS-BN 
algorithm to correct posterior probabilities. 



of the MSE (to MSE< 0.00048). The observed convergence of both SIS and LW algorithms 
was poor. A tenfold increase in sampling time had practically no effect on accuracy. Please 
note that this is a very typical case observed in our experiments. 





Original CPT 


Exact ICPT 


Learned ICPT 


"Absent" 


0.99631 


0.0037 


0.015 


"Mild" 


0.00183 


0.1560 


0.164 


"Moderate" 


0.00093 


0.1190 


0.131 


"Severe" 


0.00093 


0.7213 


0.690 



Table 1: A fragment of the conditional probability table of a node of the CPCS network 
(node gasAcute, parents hepAcute=Mild and wbcTotTho= False) in Figure 6. 

Figure 7 illustrates the ICPT learning process of the AIS-BN algorithm for the sample 
case shown in Figure 6. The displayed conditional probabilities belong to the node gasAcute 
which is a parent of two evidence nodes, diflnfGasMuc and abdPaiExaMea. The node 
gasAcute has four states: "absent," "mild," "moderate," and "severe", and two parents. 
We randomly chose a combination of its parents' states as our displayed configuration. The 
original CPT for this configuration without evidence, the exact ICPT with evidence and 
the learned ICPT with evidence are summarized numerically in Table 1. Figure 7 illustrates 
that the learned importance conditional probabilities begin to converge to the exact results 
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Figure 7: Convergence of the conditional probabilities during the example run of the AIS- 
BN algorithm captured in Figure 6. The displayed fragment of the conditional 
probability table belongs to node gasAcute which is a parent of one of the evidence 
nodes. 



stably after three updating steps. The learned probabilities in Step 10 are close to the 
exact results. In this example, the difference between PT{xi\pa{Xi),e) and Pr(a;i|pa(Xi)) is 
very large. Sampling from Pr(a;j|pa(Xj)) instead of Pr(a;j|pa(Xj), e) would introduce large 
variance into our results. 





AIS-BN 


SIS 


LW 


A* 


0.00082 


0.110 


0.148 


a 


0.00022 


0.076 


0.093 


min 


0.00049 


0.0016 


0.0031 


median 


0.00078 


0.105 


0.154 


max 


0.00184 


0.316 


0.343 



Table 2: Summary of the simulation results for all of the 75 simulation cases on the CPCS 
network. Figure 8 shows each of the 75 cases graphically. 

Figure 8 shows the MSE for all 75 test cases in our experiments with the summary 
statistics in Table 2. A paired one-tailed t-test resulted in statistically highly significant 
differences between the AIS-BN and SIS algorithms {p < 3.1 x 10^^°), and also between 
the SIS and LW algorithms {p < 1.7 x 10~^). As far as the magnitude of difference is 
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Figure 8: Performance of the AIS-BN, SIS, and LW algorithms: Mean Square Error for 
each of the 75 individual test cases plotted against the probability of evidence. 
The sampling time is 60 seconds. 



concerned, AIS-BN was significantly better than SIS. SIS was better than LW, but the 
difference was small. The mean MSEs of SIS and LW algorithms were both greater than 
0.1, which suggests that neither of these algorithms is suitable for large Bayesian networks. 

The graph in Figure 9 shows the MSE ratio between the AIS-BN and SIS algorithms. 
We can see that the percentage of the cases whose ratio was greater than 100 (two orders 
of magnitude improvement!) is 60%. In other words, we obtained two orders of magnitude 
improvement in MSE in more than half of the cases. In 80% cases, the ratio was greater 
than 50. The smallest ratio in our experiments was 2.67, which happened when posterior 
probabilities were dominated by the prior probabilities. In that case, even though the LW 
and SIS algorithms converged very fast, their MSE was still far larger than that of AIS-BN. 

Our next experiment aimed at showing how close the AIS-BN algorithm can approach 
the best possible sampling results. If we know the optimal importance sampling function, 
the convergence of the AIS-BN algorithm should be the same as that of forward sampling 
without evidence. In other words, the results of the probabilistic logic sampling algorithm 
without evidence approach the limit of how well stochastic sampling can perform. We ran 
the logic sampling algorithm on the CPCS network without evidence mimicking the test 
runs of the AIS-BN algorithm, i.e., 5 blocks of 15 runs, each repeated 10 times with a 
different random number seed. The number of samples generated was equal to the average 
number of samples generated by the AIS-BN algorithm for each series of 15 test runs. 
We obtained the average MSE n = 0.00057, with a = 0.000025, min = 0.00052, and 
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Figure 9: The ratio of MSE between SIS and AIS-BN versus percentage. 



max = 0.00065. The best results should be around this range. From Table 2, we can 
see that the minimum MSE for the AIS-BN algorithm was 0.00049, within the range of 
the optimal result. The mean MSE in AIS-BN is 0.00082, not too far from the optimal 
results. The standard deviation, cr, is significantly larger in the AIS-BN algorithm, but 
this is understandable given that the process of learning the optimal importance function is 
heuristic in nature. It is not difficult to understand that there exist a difference between the 
AIS-BN results and the optimal results. First, the AIS-BN algorithm in our tests updated 
the sampling distribution only 10 times, which may be too few times to let it converge 
to the optimal importance distribution. Second, even if the algorithm has converged to 
the optimal importance distribution, the sampling algorithm will still let the parameter 
oscillate around this distribution and there will be always small differences between the two 
distributions. 

Figure 10 shows the convergence rate for all tested cases for a four- fold increase in 
sampling time (between 15 and 60 seconds). We adjusted the convergence ratio of the 
AIS-BN algorithm by dividing it by a constant. According to Equation 3, the theoretically 
expected convergence ratio for a four-fold increase in the number of samples should be 
around two. There are about 96% cases among the AIS-BN runs whose ratio lays in 
the interval (1.75, 2.25], in a sharp contrast to 11% and 13% cases in the SIS and LW 
algorithms. The ratios of the remaining 4% cases in AIS-BN lay in the interval [2.25, 2.5]. 
In the SIS and LW algorithms, the percentage of cases whose ratio were smaller than 1.5 
was 71% and 77% respectively. Less than 1.5 means that the number of samples was too 
small to estimate variance and the results cannot be trusted. The ratio greater than 2.25 
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Figure 10: The distribution of the convergence ratio of the AIS-BN, SIS, and LW algo- 
rithms when the number of samples increases four times. 



means possibly that 60 seconds was long enough to estimate the variance, but 15 seconds 
was too short. 

4.3 The Role of AIS-BN Heuristics in Performance Improvement 

Prom the above experimental results we can see that the AIS-BN algorithm can improve 
the sampling performance significantly. Our next series of tests focused on studying the role 
of the two AIS-BN initialization heuristics. The first is initializing the ICPT tables of the 
parents of evidence to uniform distributions, denoted by U. The second is adjusting small 
probabilities, denoted by S. We denote AIS-BN without any heuristic initialization method 
to be the AIS algorithm. AIS+ C/+ 5 equals AIS-BN. We compared the following versions 
of the algorithms: SIS, AIS. SIS+U. AIS+U, SIS+S, AIS+5, SIS+U+S, AIS+U+S. All 
algorithms with SIS used the same number of samples as SIS. All algorithms with AIS used 
the same number of samples as AIS-BN. We tested these algorithms on the same 75 test 
cases used in the previous experiment. Figure 11 shows the MSE for each of the sampling 
algorithms with the summary statistics in Table 3. Even though the AIS algorithm is better 
than the SIS algorithm, the difference is not as large as in case of the AIS+ U, AIS+5', and 
AIS-BN algorithms. It seems that heuristic initialization methods help much. The results 
for the SIS+5, SIS+C/, SIS+ CZ+S* algorithms suggest that although heuristic initialization 
methods can improve performance, they alone cannot improve too much. It is fair to say 
that significant performance improvement in the AIS-BN algorithm is coming from the 
combination of AIS with heuristic methods, not any method alone. It is not difficult to 
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understand that, as only with good heuristic initialization methods is it possible to let the 
learning process quickly exit oscillation areas. Although both S and U methods alone can 
improve the performance, the improvement is moderate compared to the combination of 
the two. 
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Figure 11: A comparison of different algorithms in the CPCS network. Each bar is based 
on 75 test cases. The dotted bar shows the MSE for the SIS algorithm while 
the gray bar shows the MSE for the AIS algorithm. 
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Table 3: Summary of the simulation results for different algorithms in the CPCS network. 



4.4 Results for Other Networks 

In order to make sure that the AIS-BN algorithm performs well in general, we tested it on 
two other large networks. 

The first network that we used in our tests is the PathFinder network (Heckerman 
et al., 1990), which is the core element of an expert system that assists surgical pathologists 
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with the diagnosis of lymph- node diseases. There are two versions of this network. We used 
the larger version, consisting of 135 nodes. In contrast to the CPCS network, PathFinder 
contains many conditional probabilities that are equal to 1, which reflects deterministic 
relationships in certain settings. To make the sampling challenging, we randomly selected 
20 evidence nodes from among the leaf nodes. Each of these was an observable node (David 
Heckerman, personal communication). We verified in each case that the probability of so 
selected evidence was not equal to zero. 

We fixed the execution time of the algorithms to be 60 seconds. The learning overhead 
for the AIS-BN algorithm in the PathFinder network was about 3.5 seconds. In 60 
seconds, AIS-BN generated about 366,000 samples, SIS generated about 250,000 samples 
and LW generated about 2,700,000 samples. The reason why LW could generate more than 
10 times as many samples as SIS within the same amount of time is that the LW algorithm 
terminates sample generation at a very early stage in many samples, when the weight of a 
sample becomes zero. This is a result of determinism in the probability tables, mentioned 
above. We will see that LW benefits greatly from generating more samples. The other 
parameters used in AIS-BN were the same as those used in the CPCS network. 

We tested 20 cases, each with randomly selected 20 evidence nodes. The reported MSE 
for each case was averaged over 10 runs. Some of the runs of the SIS and LW algorithms did 
not manage to generate any effective samples (the weight score sum was equal to zero). SIS 
had only 75% effective runs and LW had only 89% effective runs, which means that in some 
runs SIS and LW were unable to yield any information about the posterior distributions. 
In all those cases, we discarded the run and only averaged over the effective runs. All 
runs in the AIS-BN algorithm were effective. We report our experimental results with the 
summary statistics in Table 4. From these data, we can see that the AIS-BN algorithm 
is still significantly better than the SIS and LW algorithms. Since the LW algorithm can 
generate more than ten times the number of samples than the SIS algorithm, its performance 
is better than that of the SIS algorithm. 
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Table 4: Summary of the simulation results for all of the 20 simulation cases on the 
PathFinder network. 



The second network that we tested was one of the ANDES networks (Conati et al., 
1997). ANDES is an intelligent tutoring system for classical Newtonian physics that is 
being developed by a team of researchers at the Learning Research and Development Center 
at the University of Pittsburgh and researchers at the United States Naval Academy. The 
student model in ANDES uses a Bayesian network to do long-term knowledge assessment. 
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plan recognition, and prediction of students' actions during problem solving. We selected 
the largest ANDES network that was available to us, consisting of 223 nodes. 

In contrast to the previous two networks, the depth of the ANDES network was sig- 
nificantly larger and so was its connectivity. There were only 22 leaf nodes. It is quite 
predictable that this kind of networks will pose difficulties to learning. We selected 20 
evidence nodes randomly from the potential evidence nodes and tested 20 cases. All pa- 
rameters were the same as those used in the CPCS network. We fixed the execution time 
of the algorithms to be 60 seconds. The learning overhead for the AIS-BN algorithm in 
the ANDES network was 13.4 seconds. In 60 seconds, AIS-BN generated about 114,000 
samples, SIS generated about 98,000 samples and LW generated about 180,000 samples. 
In this network, LW still can generate almost two times the number of samples generated 
by the SIS algorithm. 

We report our experimental results with the summary statistics in Table 5. The results 
show that also in the ANDES network the AIS-BN algorithm was significantly better than 
the SIS and LW algorithms. Since LW generated almost two times the number of samples 
that were generated by the SIS algorithm, its performance was better than that of the SIS 
algorithm. 
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Table 5: Summary of the simulation results for all of the 20 simulation cases on the ANDES 
network. 



While the AIS-BN algorithm is on the average an order of magnitude more precise 
than the other two algorithms, the performance improvement is smaller than in the other 
two networks. The reason why the performance improvement of the AIS-BN algorithm 
over the SIS and LW algorithms in the ANDES network is smaller compared to that in 
the CPCS and PathFinder networks is that: (1) The ANDES network used in our tests 
was apparently not challenging enough for sampling algorithms in general. In the ANDES 
network, SIS and LW also can perform well in some cases. The minimum MSE of SIS and 
LW in our tested cases is almost the same as that of AIS-BN. (2) The number of samples 
generated by AIS-BN in this network is significantly smaller than that in the previous 
two networks and AIS-BN needs more time to learn. Although increasing the number of 
samples will improve the performance of all three algorithms, it improves the performance 
of AIS-BN more since the convergence ratio of the AIS-BN algorithm is usually larger than 
that of SIS and LW (see Figure 10). (3) The parameters that we used in this network were 
tuned for the CPCS network. (4) The large depth and fewer leaf nodes of the ANDES 
network pose some difficulties to learning. 
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5. Discussion 

There is a fundamental trade-off in the AIS-BN algorithm between the time spent on 
learning the importance function and the time spent on sampling. Our current approach, 
which we believe to be reasonable, is to stop learning at the point when the importance 
function is good enough. In our experiments we stopped learning after 10 iterations. 

There are several ways of improving the initialization of the conditional probability 
tables at the outset of the AIS-BN algorithm. In the current version of the algorithm, we 
initialize the ICPT table of every parent N of an evidence node E (N E Fa{E), E E E) 
to the uniform distribution when Pr(£^ = e) < 1/(2 ■ ng). This can be improved further. 
We can extend the initialization to those nodes that are severely affected by the evidence. 
They can be identified by examining the network structure and local CPTs. 

We can view the learning process of the AIS-BN algorithm as a network rebuilding 
process. The algorithm constructs a new network whose structure is the same as the original 
network (except that we delete the evidence nodes and corresponding arcs). The constructed 
network models the joint probability distribution p(X\E) in Equation 8, which approaches 
the optimal importance function. We use the learned p' to approximate this distribution. 
If p' approximates Pr(X|E) accurately enough, we can use this new network to solve other 
approximate tasks, such as the problem of computing the Maximum A-Posterior assignment 
(MAP) (Pearl, 1988), finding k most likely scenarios (Seroussi & Golmard, 1994), etc. A 
large advantage of this approach is that we can solve each of these problems as if the network 
had no evidence nodes. 

We know that Markov blanket scoring can improve convergence rates in some sampling 
algorithms (Shwe &; Cooper, 1991). It may also be applied to the AIS-BN algorithm to 
improve its convergence rate. According to Property 4 (Section 2.1), any technique that can 
reduce the variance crp will reduce the variance of Pr(e) and correspondingly improve 

the sampling performance. Since the variance of stratified sampling (Rubinstein, 1981) is 
never much worse than that of random sampling, and can be much better, it can improve the 
convergence rate. We expect some other variance reduction methods in statistics, such as: 
{i) the expected value of a random variable; (m) antithetic variants correlations (stratified 
sampling, Latin hypercube sampling, etc.); and {in) systematic sampling, will also improve 
the sampling performance. 

Current learning algorithm used a simple approach. Some heuristic learning methods, 
such as adjusting learning rates according to changes of the error (Jacobs, 1988), should 
also be applicable to our algorithm. There are several tunable parameters in the AIS-BN 
algorithm. Finding the optimal values of these parameters for any given network is another 
interesting research topic. 

It is worth observing that the plots presented in Figure 8 are fairly flat. In other words, 
in our tests the convergence of the sampling algorithms did not depend too strongly on the 
probability of evidence. This seems to contradict the common belief that forward sampling 
schemes suffer from unlikely evidence. AIS-BN for one shows a fairly flat plot. The 
convergence of the SIS and LW algorithms seems to decrease slightly with unlikely evidence. 
It is possible that all three algorithms will perform much worse when the probability of 
evidence drops below some threshold value, which our tests failed to approach. Until this 
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relationship has been studied carefully, we conjecture that the probability of evidence is not 
a good measure of difficulty of approximate inference. 

Given that the problem of approximating probabilistic inference is NP-hard, there exist 
networks that will be challenging for any algorithm and we have no doubt that even the 
AIS-BN algorithm will perform poorly on them. To the day, we have not found such 
networks. There is one characteristic of networks that may be challenging to the AIS-BN 
algorithm. In general, when the number of parameters that need to be learned by the AIS- 
BN algorithm increases, its performance will deteriorate. Nodes with many parents, for 
example, are challenging to the AIS-BN learning algorithm, as it has to update the ICPT 
tables under all combinations of the parent nodes. It is possible that conditional probability 
distributions with causal independence properties, such as Noisy-OR distributions (Pearl, 
1988; Henrion, 1989; Diez, 1993; Srinivas, 1993; Heckerman &; Breese, 1994), common in 
very large practical networks, can be treated differently and lead to considerable savings in 
the learning time. 

One direction of testing approximate algorithms, suggested to us by a reviewer, is to use 
very large networks for which exact solution cannot be computed at all. In this case, one 
can try to infer from the difference in variance at various stages of the algorithm whether 
it is converging or not. This is a very interesting idea that is worth exploring, especially 
when combined with theoretical work on stopping criteria in the line of the work of Dagum 
and Luby (1997). 

6. Conclusion 

Computational complexity remains a major problem in application of probability theory 
and decision theory in knowledge-based systems. It is important to develop schemes that 
improve the performance of updating algorithms — even though the theoretically demon- 
strated worst case will remain NP-hard, many practical cases may become tractable. 

In this paper, we studied importance sampling in Bayesian networks. After reviewing 
the most important theoretical results related to importance sampling in finite-dimensional 
integrals, we proposed a new algorithm for importance sampling in Bayesian networks that 
we call adaptive importance sampling (AIS-BN). While the process of learning the optimal 
importance function for the AIS-BN algorithm is computationally intractable, based on the 
theory of importance sampling in finite-dimensional integrals we proposed several heuristics 
that seem to work very well in practice. We proposed heuristic methods for initializing the 
importance function that we have shown to accelerate the learning process, a smooth learn- 
ing method for updating importance function using the structural advantages of Bayesian 
networks, and a dynamic weighting function for combining samples from different stages 
of the algorithm. All these methods help the AIS-BN algorithm to get fairly accurate 
estimates of the posterior probabilities in a limited time. Of the two applied heuristics, 
adjustment of small probabilities, seems to lead to the largest improvement in performance, 
although the largest decrease in MSE is achieved by a combination of the two heuristics 
with the AIS-BN algorithm. 

The AIS-BN algorithm can lead to a dramatic improvement in the convergence rates in 
large Bayesian networks with evidence compared to the existing state of the art algorithms. 
We compared the performance of the AIS-BN algorithm to the performance of likelihood 
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weighting and self-importance sampling on a large practical model, the CPCS network, 
with evidence as unlikely as 5.54 x lO^'^'^ and typically 7 x 1.0^^^. In our experiments, we 
observed that the AIS-BN algorithm was always better than likelihood weighting and self- 
importance sampling and in over 60% of the cases it reached over two orders of magnitude 
improvement in accuracy. Tests performed on the other two networks, PathFinder and 
ANDES, yielded similar results. 

Although there may exist other approximate algorithms that will prove superior to AIS- 
BN in networks with special structure or distribution, the AIS-BN algorithm is simple 
and robust for general evidential reasoning problems in large multiply-connected Bayesian 
networks. 
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